Table of contents

Broadcast analysis

First of all, import the datasets i obtained from ORFEO

bcast_default.df<- read.csv(file = "bcast_default.csv")
bcast_linear.df<- read.csv(file = "bcast_linear.csv")
bcast_chain.df<- read.csv(file = "bcast_chain.csv")
bcast_binarytree.df<- read.csv(file = "bcast_binarytree.csv")

Now we can bind them into a single df:

bcast.df<- rbind(bcast_default.df,bcast_linear.df,bcast_chain.df,bcast_binarytree.df)

Fix algorithm and see how latency changes when allocation varies

Bcast linear

At the moment let’s fix the size of the message to 2 MPI_CHAR

linear_map_plt<- bcast_linear.df %>% filter(MessageSize == 1) %>%
  ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
    geom_line(linetype = 1) +
    geom_point(size=1)+
    geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
    annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
    scale_color_manual(values = c("#82CD47","#E67E22","#3498DB")) +
    xlim(2, 48) +  # Adjust the x-axis limits as needed
    labs(x = 'N. of processes (#cores)',
         y = 'Latency (μs)',
         title = 'Latency vs #cores (MessageSize=1 MPI_CHAR)',
         color = 'Allocation')+
    theme_light()+
    theme(
      legend.position = "bottom",
      legend.text = element_text(size = 12) 
    )
linear_map_plt

Comments:

  • As expected, both allocation by socket and core reveals some “jumps” when changing node, while allocation by node proceeds “linearly”
  • Regarding the jump when changing socket, allocation by core reveals a jump, while allocation by socket proceeds linearly
  • Something strange (I cannot explain it damn 🥲) is the position of the jumps…Why at 30 and not at 24? Isac suggested something about strange settings on external variables…if we were on EPYC we could suspect something about NUMA regions, but in thin nodes we have just 1 single numa, so it is not our case.

Bcast chain

As we did above for linear broadcast, we fix the size of the message to 2 MPI_CHAR

chain_map_plt<- bcast_chain.df %>% filter(MessageSize == 1) %>%
ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
  geom_line(linetype = 1) +
  geom_point(size=1)+
  geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
  annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
  scale_color_manual(values = c("#82CD47","#E67E22","#3498DB")) +
  xlim(2, 48) +  # Adjust the x-axis limits as needed
  labs(x = 'N. of processes (#cores)',
       y = 'Latency (μs)',
       title = 'Latency vs #cores (MessageSize=1 MPI_CHAR)',
       color = 'Allocation')+
  theme_light()+
  theme(
    legend.position = "bottom",
      legend.text = element_text(size = 12) 
  )
chain_map_plt

Comments: - same consideration as before - We can notice a huge difference w.r.t linear when we change node. For linear, when we have np>=24 the 3 different allocations seem to converge. And it makes sense, since linear sends from process 0 to all the other, and when we have processes in both nodes it will be the same (the same steps will be executed independently from the allocation). However, this is not the case for chain algorithm, where a precise order is followed. If we distribute processes by node, obviously the distance of messages will be longer than allocating by core or socket. And this is evident in the plot. Differences between allocation by core and socket, instead, are not so evident (probably the difference of communication btw cores in the same socket and cores in the same node is not so large…maybe working on just 1 numa region is beneficial?!)

Bcast BinaryTree

binarytree_map_plt<- bcast_binarytree.df %>% filter(MessageSize == 1) %>%
 ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
  geom_line(linetype = 1) +
  geom_point(size=1)+
  geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
  annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
  scale_color_manual(values = c("#82CD47","#E67E22","#3498DB")) +
  xlim(2, 48) +  # Adjust the x-axis limits as needed
  labs(x = 'N. of processes (#cores)',
       y = 'Latency (μs)',
       title = 'Latency vs #cores (MessageSize=1 MPI_CHAR)',
       color = 'Allocation')+
  theme_light()+
  theme(
    legend.position = "bottom",
      legend.text = element_text(size = 12) 
  )
binarytree_map_plt

  • Here we shoud understand why allocation by node have such behavior…
bcast_default.df %>% filter(MessageSize == 1) %>%
ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
  geom_line(linetype = 1) +
  geom_point(size=1)+
  geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
  annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
  scale_color_manual(values = c("#82CD47","#E67E22","#3498DB")) +
  xlim(2, 48) +  # Adjust the x-axis limits as needed
  labs(x = 'N. of processes (#cores)',
       y = 'Latency',
       title = 'Latency vs #cores (MessageSize=1 MPI_CHAR)',
       color = 'Allocation')+
  theme_light()

Fix map allocation and compare different algorithms

Comparing Linear vs Chain

As for now, let’s focus on the comparison between Linear vs Chain

bcast.df %>% filter(MessageSize == 1)  %>% filter(Allocation == "core") %>% filter(Algorithm %in% c("Linear","Chain"))  %>%
  ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
    geom_line(linetype = 1) +
    geom_point(size=1)+
    geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
    annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
    scale_color_manual(values = c("#E67E22","#3498DB")) +
    xlim(2, 48) +  # Adjust the x-axis limits as needed
    labs(x = 'N. of processes (#cores)',
         y = 'Latency',
         title = 'Broadcast Latency vs #cores (MessageSize = 1)',
         color = 'Algorithm')+
    theme_light()

Comments:

Remember that we are dealing with --map-by core

  • Region 1 (within socket): chain and linear performs ~ the same–> this reveals that probaably the communication intra socket (so between cores in the same socket) is so fast that the implementation algorithm does not impact significantly (or the difference is imperceptible given the y-axis scale of the entire plot)

    NB: here it might be interesting to show point to point latency (since we are working on a short size message, so we are considering “pure” latency). And from this observation then we can move to the plots where i increase message size to show the difference in algorithms performance

  • Region 2 (within node): chain and linear still perform ~ the same–> maybe also the communication intra node (between cores, even if they are in different sockets) is too fast that the difference is not so evident. Anyway, there is a small difference that highlights chain having a smaller latency. Reasoning on the implementation of chain, it makes sense: both because chain work with contiguous core (while flat has more “core-jumps”), both because the paper stated that MPI chain sends the message dividing data into chunks (while flat is not able to do that)

  • Region 3 (outside node): obviously in this region the latency is significantly higher, and the difference between chain and linear algorithm is more evident. As explained before, this makes sense: chain works with contiguous cores, while flat tree sends from 0 to all other ranks!

BONUS: in the plot above the difference between algorithms was not so evident, but this might derive from the small size of the message. Even if flat tree is less efficient than chain, the fact that we are sending a small message (1 MPI_CHAR) makes both the algorithms super fast.

Recall that the choice of considering at first a small sized message was to exclude as much as possible the time needed for sending/storing the message and to focus as much as possible to the “pure” latency

BUT, if we now increase the size of the message, the different performances of the 2 algorithms will be more evident:

Plots=list()
sizes=c(64,256,2048)
for(i in 1:length(sizes)){
  # Filter by current size
  current_plot<- bcast.df %>% filter(Allocation == "core") %>% filter(Algorithm %in% c("Linear","Chain"))  %>% filter(MessageSize == sizes[i]) %>%
                  ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
                          geom_line(linetype = 1) +
                          geom_point(size=1)+
                          geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
                          annotate("text",x=c(12,24),y=c(0,0),label=c("12 Cores","24 cores"),vjust=0,hjust=-0.1,color="#616A6B")+
                          scale_color_manual(values = c("#E67E22","#3498DB")) +
                          xlim(2,48) +  # Adjust the x-axis limits as needed
                          labs(x = 'N. of processes (#cores)',
                               y = expression('Latency (µs)'),
                               title = paste0("Broadcast Latency vs #cores (MessageSize =",sizes[i],")"),
                               color = 'Algorithm')+
                          theme_light()
  Plots[[i]]<- current_plot
}

Plots[[1]] | Plots[[2]] | Plots[[3]]

NB: this thing about message size can be commented mentioning the way chain is more performant than linear, since chain allows for message partitioning (and linear NO!)…eventually we can cite the reference!

Comparing all bcast algorithms

Noticed that allocation by core is associated to the lowest latency, let’s fix Allocation to core and compare the algorithm behind the MPI broadcast operation.

Considering a message size = 2 MPI_CHAR we obtain:

bcast.df %>% filter(MessageSize == 1)  %>% filter(Allocation == "core") %>%
  ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
    geom_line(linetype = 1) +
    geom_point(size=1)+
    geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
    annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
    scale_color_manual(values = c("#27AE60","#E67E22","#8E44AD","#3498DB")) +
    xlim(2, 48) +  # Adjust the x-axis limits as needed
    labs(x = 'N. of processes (#cores)',
         y = 'Latency',
         title = 'Broadcast Latency vs #cores (MessageSize = 1)',
         color = 'Algorithm')+
    theme_light()

Instead, if we let the size vary we get something similar as seen in Linear vs Chain

Plots=list()
sizes=c(64,256,2048)
for(i in 1:length(sizes)){
  # Filter by current size
  current_plot<- bcast.df %>% filter(Allocation == "core") %>% filter(MessageSize == sizes[i]) %>%
                    ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
                    geom_line(linetype = 1) +
                    geom_point(size=1)+
                    geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
                    annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
                    scale_color_manual(values = c("#27AE60","#E67E22","#8E44AD","#3498DB")) +
                    xlim(2, 48) +  # Adjust the x-axis limits as needed
                    labs(x = 'N. of processes (#cores)',
                         y = expression('Latency (µs)'),
                         title = paste0("Broadcast Latency vs #cores (MessageSize =",sizes[i],")"),
                         color = 'Algorithm')+
                    theme_light()
  Plots[[i]]<- current_plot
}

Plots[[1]] | Plots[[2]] | Plots[[3]]

Fix algorithm and allocation and see how latency changes when message size varies

Linear

bcast_linear.df %>%
  filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,1024,2048,16384)) %>%
  ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
  geom_line(linetype=1) +
  geom_point() +
  labs(
    x = "Processes",
    y = "Latency",
    title = "Latency vs Processes for Linear algorithm (Allocation='core')",
    color = "Message Size"
  ) +
  xlim(c(2,48))+
  theme_light()

Note : i did NOT considered all the messaze sizes since the latency grows exponentially-like w.r.t message size. We will consider it better when building a model…

Chain

bcast_chain.df %>%
  filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,2048,4096,8192)) %>%
  ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
  geom_line(linetype=1) +
  geom_point(size=1) +
  labs(
    x = "Processes",
    y = "Latency",
    title = "Latency vs Processes for Chain algorithm (Allocation='core')",
    color = "Message Size"
  ) +
  xlim(c(2,48))+
  theme_light()

Binary tree

bcast_binarytree.df %>%
  filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,2048,4096,8192)) %>%
  ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
  geom_line(linetype=1) +
  geom_point(size=1) +
  labs(
    x = "Processes",
    y = "Latency",
    title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')",
    color = "Message Size"
  ) +
  xlim(c(2,48))+
  theme_light()

Performance Models

Now we fix allocation to core, as it was the more performant allocation for all the algorithms. Our aim is to fit a performance model able to predict latency basing on data

Disclaimer: it might be biased from my statistical background, but I was not able to implement latency models explained in the paper…

Model for linear alg

First of all, let’s plot the variables to spot some relations:

bcast_linear.df %>% 
  filter(Allocation == "core") %>%
  plot_ly(
    x = ~MessageSize,
    y = ~Processes,
    z = ~Latency
  )%>%
  add_markers(
    marker = list(color = ~Latency,
                  line = list(color = "black",width = 0.8 ))
  ) %>%
  layout(
    xaxis = list(title = "Processes"),
    yaxis = list(title = "Latency"),
    title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')"
  )

We can implement a simple linear model (it makes sense to remove the intercept)

fit.linear<- lm(formula= Latency ~ -1+MessageSize+Processes,
                data=bcast_linear.df %>% filter(Allocation == "core")
                )
summary(fit.linear)
## 
## Call:
## lm(formula = Latency ~ -1 + MessageSize + Processes, data = bcast_linear.df %>% 
##     filter(Allocation == "core"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1133.70   -32.02   -20.19    -9.89   650.40 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## MessageSize 1.136e-03  2.454e-05  46.294  < 2e-16 ***
## Processes   1.083e+00  2.269e-01   4.775 2.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 137.4 on 502 degrees of freedom
## Multiple R-squared:  0.838,  Adjusted R-squared:  0.8373 
## F-statistic:  1298 on 2 and 502 DF,  p-value: < 2.2e-16

Let’s give a marginal look at our response variable:

hist(bcast_linear.df$Latency,breaks = 70,
     xlab="Latency(us)",
     main="Histogram of Latency (Linear Broadcast)")

EVIDENT SKEWNESS! Let’s try to apply log2 transformation (in the model I will apply it both to Latency and MessageSize):

hist(log2(bcast_linear.df$Latency),breaks = 40,
     xlab="log2(Latency(us))",
     main="Histogram of log2 Latency (Linear Broadcast)")

bcast_linear.df %>% 
  filter(Allocation == "core") %>%
  plot_ly(
    x = ~log2(MessageSize),
    y = ~Processes,
    z = ~log2(Latency)
  )%>%
  add_markers(
    marker = list(color = ~log2(Latency),
                  line = list(color = "black",width = 0.8 ))
  ) %>%
  layout(
    xaxis = list(title = "Processes"),
    yaxis = list(title = "Latency"),
    title = "Latency vs Processes for Linear algorithm (Allocation='core')"
  )

The linear model becomes:

fit.linear.log<- lm(formula= log2(Latency) ~ -1+Processes+log2(MessageSize),
                data=bcast_linear.df %>% filter(Allocation == "core")
                )
summary(fit.linear.log)
## 
## Call:
## lm(formula = log2(Latency) ~ -1 + Processes + log2(MessageSize), 
##     data = bcast_linear.df %>% filter(Allocation == "core"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4898 -1.1118 -0.5450  0.3623  3.6810 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## Processes         0.029815   0.003808    7.83 2.92e-14 ***
## log2(MessageSize) 0.321259   0.009308   34.51  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.62 on 502 degrees of freedom
## Multiple R-squared:  0.8824, Adjusted R-squared:  0.8819 
## F-statistic:  1883 on 2 and 502 DF,  p-value: < 2.2e-16

Comment: \(R^2_{adj}\) increases from 83,7% to 88,2% !!!

Model for chain alg

bcast_chain.df %>% 
  filter(Allocation == "core") %>%
  plot_ly(
    x = ~log2(MessageSize),
    y = ~Processes,
    z = ~log2(Latency)
  )%>%
  add_markers(
    marker = list(color = ~log2(Latency),
                  line = list(color = "black",width = 0.8 ))
  ) %>%
  layout(
    xaxis = list(title = "Processes"),
    yaxis = list(title = "Latency"),
    title = "Latency vs Processes for chain algorithm (Allocation='core')"
  )
fit.chain.log<- lm(formula= log2(Latency) ~ -1+Processes+log2(MessageSize),
                data=bcast_chain.df %>% filter(Allocation == "core")
                )
summary(fit.chain.log)
## 
## Call:
## lm(formula = log2(Latency) ~ -1 + Processes + log2(MessageSize), 
##     data = bcast_chain.df %>% filter(Allocation == "core"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3286 -1.3089 -0.5489  0.6052  3.1535 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## Processes         0.021957   0.003739   5.873  7.8e-09 ***
## log2(MessageSize) 0.306713   0.009139  33.560  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.591 on 502 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8677 
## F-statistic:  1654 on 2 and 502 DF,  p-value: < 2.2e-16

Model for binarytree alg

bcast_binarytree.df %>% 
  filter(Allocation == "core") %>%
  plot_ly(
    x = ~log2(MessageSize),
    y = ~Processes,
    z = ~log2(Latency)
  )%>%
  add_markers(
    marker = list(color = ~log2(Latency),
                  line = list(color = "black",width = 0.8 ))
  ) %>%
  layout(
    xaxis = list(title = "Processes"),
    yaxis = list(title = "Latency"),
    title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')"
  )
fit.binarytree.log<- lm(formula= log2(Latency) ~ -1+Processes+log2(MessageSize),
                data=bcast_binarytree.df %>% filter(Allocation == "core")
                )
summary(fit.binarytree.log)
## 
## Call:
## lm(formula = log2(Latency) ~ -1 + Processes + log2(MessageSize), 
##     data = bcast_binarytree.df %>% filter(Allocation == "core"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4247 -1.3128 -0.6351  0.5242  3.6288 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## Processes         0.016315   0.003766   4.332 1.79e-05 ***
## log2(MessageSize) 0.317449   0.009207  34.480  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.603 on 502 degrees of freedom
## Multiple R-squared:  0.8663, Adjusted R-squared:  0.8658 
## F-statistic:  1626 on 2 and 502 DF,  p-value: < 2.2e-16

Model for default alg

fit.default.log<- lm(formula= log2(Latency) ~ -1+Processes+log2(MessageSize),
                data=bcast_default.df %>% filter(Allocation == "core")
                )
summary(fit.default.log)
## 
## Call:
## lm(formula = log2(Latency) ~ -1 + Processes + log2(MessageSize), 
##     data = bcast_default.df %>% filter(Allocation == "core"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4849 -1.2933 -0.5753  0.5509  3.2011 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## Processes         0.009586   0.003781   2.535   0.0115 *  
## log2(MessageSize) 0.321418   0.009242  34.778   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.609 on 502 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8586 
## F-statistic:  1532 on 2 and 502 DF,  p-value: < 2.2e-16